Distributed Deep Learning Using Synchronous Stochastic Gradient Descent
نویسندگان
چکیده
We design and implement a distributed multinode synchronous SGD algorithm, without altering hyperparameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record training throughputs. A 512 minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64 node cluster. We also demonstrate the generality of our approach via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt to democratize deep-learning by training on an Ethernet based AWS cluster and show 14X scaling on 16 nodes.
منابع مشابه
Probabilistic Synchronous Parallel
Most machine learning and deep neural network algorithms rely on certain iterative algorithms to optimise their utility/cost functions, e.g. Stochastic Gradient Descent (SGD). In distributed learning, the networked nodes have to work collaboratively to update the model parameters, and the way how they proceed is referred to as synchronous parallel design (or barrier control). Synchronous parall...
متن کاملDemystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
DeepNeural Networks (DNNs) are becoming an important tool inmodern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. Specifically, we present trends in DNN architectures and th...
متن کاملHow to scale distributed deep learning?
Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the ...
متن کاملHigh Throughput Synchronous Distributed Stochastic Gradient Descent
We introduce a new, high-throughput, synchronous, distributed, data-parallel, stochasticgradient-descent learning algorithm. This algorithm uses amortized inference in a computecluster-specific, deep, generative, dynamical model to perform joint posterior predictive inference of the mini-batch gradient computation times of all worker-nodes in a parallel computing cluster. We show that a synchro...
متن کاملMiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster
In this paper, we present a co-designed petascale high-density GPU cluster to expedite distributed deep learning training with synchronous Stochastic Gradient Descent (SSGD). This architecture of our heterogeneous cluster is inspired by Harvard architecture. Regarding to different roles in the system, nodes are configured as different specifications. Based on the topology of the whole system’s ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1602.06709 شماره
صفحات -
تاریخ انتشار 2016